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ABSTRACT 

Motivation: Aligning reads to a reference sequence is a fundamental 
step in numerous bioinformatics pipelines. As a consequence, the 
sensitivity and precision of the mapping tool, applied with certain 
parameters to certain data, can critically affect the accuracy of 
produced results (e.g., in variant calling applications). Therefore, 
there has been an increasing demand of methods for comparing 
mappers and for measuring effects of their parameters. 

Read simulators combined with alignment evaluation tools provide 
the most straightforward way to evaluate and compare mappers. 
Simulation of reads is accompanied by information about their 
positions in the source genome. This information is then used 
to evaluate alignments produced by the mapper. Finally, reports 
containing statistics of successful read alignments are created. 

In default of standards for encoding read origins, every evaluation 
tool has to be made explicitly compatible with the simulator used to 
generate reads. 

Results: In order to solve this obstacle, we have created a generic 
format Rnf (Read Naming Format) for assigning read names with 
encoded information about original positions. 

Futhermore, we have developed an associated software package 
RnfTools containing two principal components. MIShmash applies 
one of popular read simulating tools (among DwgSim, Art, Mason, 
CuReSim etc.) and transforms the generated reads into Rnf format. 
LAVEnder evaluates then a given read mapper using simulated 
reads in Rnf format. A special attention is payed to mapping qualities 
that serve for parametrization of Roc curves, and to evaluation of the 
effect of read sample contamination. 

Availability and implementation: 

RNF spec.: http: //github. com/karel-brinda/rnf-spec 

RNFTOOLS: http://github.com/karel-brinda/rnftools 
Contact: karel. brinda@univ-mlv. f r 


1 INTRODUCTION 

The number of Next-Generation Sequencing (Ngs) read mappers 
has been rapidly growing during the last years. Then, there is an 
increasing demand of methods for evaluation and comparison of 
mappers in order to select the most appropriate one for a specific 
task. 

The basic approach to compare mappers is based on simulating 
Ngs reads, aligning them to the reference genome and assessing 
read mapping accuracy using a tool evaluating if each individual 
read has been aligned correctly. 


There exist many read simulators ( WgShvQ 


DwgSi 


CuReSim ( Cabochee£a/J |2014|>, Art jHuang et aT) |201 
Mason (|FIoltgrewe 20 1 0}, Pirs |Xu et al . | 2012| >), Xs~i Pratas et 


al. 


al. 


sSuvj5 

[20Tl>, 


I20W), FlowSim jBalzer et «/||2010|, GemSim dMcElroy et 
20l2t , PbSim ( |Ono et a/.||20T3| , SINC jPattnaik et a/.[|2QI4} 


FastqSim fShcherbma 2014} ) as well as many evaluation tools 
(WgSim.Eval, DwgSim_Eval, CuReSimEval, RABeMa 
iHoltg rewe et al] |201 1} , etc.). Flowever, each read simulator 
encodes information about the origin of reads in its own manner. 
This makes combining tools complicated and makes writing ad-hoc 
conversion scripts inevitable. 

Flere we propose a standard for naming simulated Ngs reads, 
called Read Naming Format (Rnf), that makes evaluation tools for 
read mappers independent of the tool used for read simulation. 

Furthermore, we introduce RnfTools, an easily configurable 
software, to obtain simulated reads in Rnf format using a wide class 
of existing read simulators, and also to evaluate Ngs mappers. 


1.1 Simulation of reads 

A typical read simulator introduces mutations into a given reference 
genome (provided usually as a Fasta file) and generates reads 
as genomic substrings with randomly added sequencing errors. 
Different statistical models can be employed to simulate sequencing 
errors and artefacts observed in experimental reads. The models 
usually take into account CG-content, distributions of coverage, of 
sequencing errors in reads, and of genomic mutations. Simulators 
can often leant their parameters from an experimental alignment file. 

At the end, information about origin of every read is encoded in 
some way and the reads are saved into a Fastq file. 

1.2 Evaluation of mappers 

When simulated reads are mapped back to the reference sequence 
and possibly processed by an independent post-processing tool 
(remapping around indels, etc.), an evaluation tool inputs the final 
alignments of all reads, extracts information about their origin and 
assesses if every single read has been aligned to a correct location 
(and possibly with correct edit operations). The whole procedure is 
finalized by creating a summarizing report. 

Various evaluation strategies can be employed (see, e.g., 
introduction of jCaboche et al . [ |2014| l). Final statistics usually 
strongly depends on the definition of a correctly mapped read, 
mapper’s approach to deal with multi-mapped reads and with 
mapping qualities. 


*to whom correspondence should be addressed 


1 http://github.com/lh3/wgsim 

2 http://github.com/nhl3/dwgsim 
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Coor 12345678901234-5678901234567890123456789 

Source 1 - reference genome 

chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC 

chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa 

Source 2 - generator of random sequences 


READS: 

rOOl ATG-TAGATA -> 

r002/l TTAGATAACGA -> 

r002/2 <- TCAG-CGGG 

r003/l tgcaaataa -> 

r003/2 gaa-gacc-t -> 

r004 ATAGCT.TCAG -> 

r005 GTAGG -> 

<- agacctt 


<- TCGACACG 

r006 ATATCACATCATTAGACACTA 

(a) Simulated reads 


r. tuple 

LRN 

SRN 

rOOl 

sim l(l, 1,F, 01, 10) [single.end] 

#1 

r002 

sim__2__ (1, 1, F, 04, 14) , (1, 1, R, 31, 39) 
[paired.end] 

#2 

r003 

sim__3__ (1,2, F, 0 9, 17) , (1,2, F, 25, 33) __ 
[mate.pair ] 

#3 

r004 

sim__4__(1, 1, F, 15, 36) __[spliced] , 

C:[6=12N4=] 

#4 

r005 

sim__5__(1, 1,R, 15,22), (1,1,F,25,29), 

(1, 2,R, 05, 11) [chimeric] 

#5 

r006 

rnd__6__(2, 0,N, 00, 00) ..[random] 

#6 


(b) Long and short read names 


Fig. 1: Example of simulated reads (in our definitions read tuples ) 
and their corresponding Long Read Names and Short Read Names, 
which can be used as read names in the final Fastq file. 


1.3 Existing read naming approaches 

Depending on the read simulator, information about reads origin 
is either encoded in its name, or stored in a separate file, possibly 
augmented with information about the expected read alignment. 
While WgSim encodes the first nucleotide of each end of the read 
in the read name, DwgSim and CuReSim encode the leftmost 
nucleotide of each end. Art produces SAM and Aln alignment 
files, Mason creates SAM files, and PlRS makes text files in own 
format. 


2 METHODS 

We have created Rnf, a standard for naming simulated reads. It is designed 
to be robust, easy to adopt by existing tools, extendable, and to provide 
human-readable read names. We then developed an utility for generating 
RNF-compliant reads using existing simulators, and an associated mapper 
evaluation tool. 

2.1 Read Naming Format (RNF) 

2.1.1 Read tuples. Read tuple is a tuple of sequences (possibly 
overlapping) obtained from a sequencing machine from a single fragment of 


DNA. Elements of these tuples are called reads. For example, every “paired- 
end read” is a read tuple and both of its “ends” are individual reads in our 
notation. 

To every read tuple, two strings are assigned: a short read name (SRN) 
and a long read name (LRN). SRN contains a hexadecimal unique ID of 
the tuple prefixed by ‘#‘. LRN consists of four parts delimited by double¬ 
underscore: i) a prefix (possibly containing expressive information for a user 
or a particular string for sorting or randomization of order of tuples), ii) a 
unique ID, iii) information about origins of all segments (see below) that 
constitute reads of the tuple, iv) a suffix containing arbitrary comments or 
extensions (for holding additional information). 

Preferred final read names are LRNs. If an LRN exceeds 255 (maximum 
allowed read length in SAM), SRNs are used instead and a SRN-LRN 
correspondence file must be created. 

2.1.2 Segments. Segments are substrings of a read which are spatially 
distinct in the reference and they correspond to individual lines in a SAM 
file. Thus, each read has an associated chain of segments and we associate a 
read tuple with segments of all its reads. 

Within our definitions, a “single-end read” consists of a single read with 
a single segment unless it comes from a region with genomic rearrangment. 
A “paired-end read” or a “mate-pair read” consists of two reads, each with 
one segment (under the same condition). A “strobe read” consists of several 
reads. Chimeric reads (i.e., reads corresponding to a genomic fusion, a long 
deletion, or a translocation) have at least two segments. 

For each segment, the following information is encoded: leftmost and 
rightmost 1-based coordinates in its reference, ID of its reference genome, 
ID of the chromosome and the direction (’F’ or ’R’). The format is: 
(genome.id, chromosome.id, direction, L.coor, R.coor). 

Segments in LRN are recommended to be sorted with the following keys: 
source, chromosome, L.coor, R.coor, direction. When some 
information is not available (e.g., the rightmost coordinate), zero is used 
(‘N’ in case of direction). 

2.1.3 Extensions. The basic standard can be extended for specific 
purposes by extensions (e.g., CIGAR string extension). They are part of the 
suffix and encode supplementary information. 

Examples of Rnf names are shown in Figure [T] 

2.2 RNFtools 

We also developed RNFTOOLS, a software package associated with Rnf. 
It has two principal components: MISHMASH for read simulation and 
LAVENDER for evaluation of NGS read mappers. 

RNFTOOLS has been created using SNAKEMAKE (Koster and Rahmann] 
|2012) , a Python-based Make-like build system. All employed external 
programs are installed automatically when needed. 

2.2.1 MIShmash read simulator. MIShmash is a pipeline for 
simulating reads using existing simulators and combining obtained sets of 
reads together (e.g., in order to simulate contamination or metagenomic 
samples). Its output files respect Rnf format, therefore, any RNF-compatible 
evaluation tool can be used for evaluation. 

2.2.2 LAVEnder evaluation tool for read mappers. LAVEnder is 
a program for evaluating mappers. For a given set of Bam files, it creates an 
interactive HTML report with several graphs. 

In practice, mapping qualities assigned by different mappers to a given 
read are not equal (although mappers tend to unify this). Moreover, even for a 
single mapper, mapping qualities are very data-specific. Therefore, results of 
mappers after the same thresholding on mapping quality are not comparable. 
To cope with this, we designed LAVENDER to use mapping qualities as 
parameterization of curves in ‘sensitivity-precision’ graphs (like it has been 
done in 0(2013)). Examples of output of LAVENDER can be found in 
Figure [2] 
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Fig. 2: Example of two graphs produced by LAVENDER as a part of comparison of mapper capabilities of contamination detection. 
200.000 single-end reads were simulated from human and mouse genomes (100.000 from HG38, 100.000 from MM10) by DwgSim 
using MISHMASH and mapped to HG38. All LAVENDER graphs have false discovery rate on x-axis and use mapping quality as the varying 
parameter for plotted curves. 


3 CONCLUSION 

We designed Rnf format and propose it as a general standard for 
naming simulated NGS reads. 

We developed RnfTools consisting of MISHMASH, a pipeline 
for read simulation, and LAVENDER, an evaluation tool 
for mappers, both following the Rnf convention (thus inter¬ 
compatible). Currently, MISHMASH has a built-in interface with 
the following existing read simulators: WgSim, DwgSim, Art, 
and CuReSim. 

We expect that authors of existing read simulators will adopt Rnf 
naming convention as it is technically simple and would allow them 
to extend the usability of their software. We also expect authors of 
evaluation tools to use Rnf to make their tools independent of the 
used read simulator. 

The main benefit for users is the ease of switching between read 
simulators and read evaluation tools, without need of writing any 
special conversion scripts. Altogether, it simplifies the process of 
evaluation of Ngs mappers and accelerates the debugging of tools 
for processing Ngs data. 
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